Week 5 of 12 · Part A — Applied Safety

How Jailbreaks Work (Defensively)

Why a safety-trained model can still be steered off its rails — and what that tells a defender

Day 21 ~60 minutes Concept

Day 21 of 60

The uncomfortable premise of the week

For four weeks you've built the machinery of behaving-well-by-default: taxonomies, policies, red-teams, evals. This week confronts the thing that machinery can't fully fix. A model can be helpful, well-policied, and safety-tuned — and still collapse under inputs engineered to break it. That's the week's thesis: capability is not robustness, and a behavior the model exhibits 99.9% of the time can be reliably defeated by an adversary who only needs it to fail once.

The thesis

Safety training shifts a model's default behavior; it does not install a guarantee. A jailbreak is any input that moves the model off that default into disallowed territory. We study how they work only to defend — to know which layer catches which attack, where the gaps are, and why no single safeguard is enough.

A note on framing before we go further: everything below is taught at the mechanism level. You will not find working strings, suffixes, or recipes here, and you don't need them. A defender's edge comes from understanding why a class of attack works, not from being able to run one.

Why safety training is brittle, conceptually

The most useful mental model comes from Jailbroken: How Does LLM Safety Training Fail? (Wei et al., 2023). It names two failure modes that explain almost every jailbreak you'll ever see.

Core Theory

1 · Competing objectives

The model is trained to be helpful and to be harmless at the same time. An attacker constructs a context where those two goals collide — where being helpful, following instructions, or staying in character pulls against the refusal. When the helpful objective wins the tug-of-war, you get a jailbreak. The safety behavior wasn't deleted; it was out-pulled.

2 · Mismatched generalization

Safety tuning generalizes over the distribution of inputs it was trained on. Push the input far enough outside that distribution — unusual encodings, rare framings, formats the safety data never covered — and the harmlessness behavior simply doesn't fire, because the model doesn't recognize the situation as the kind it was taught to refuse.

Read this back to yourself

Competing objectives = the safety behavior is present but overpowered. Mismatched generalization = the safety behavior never activates because the input looks unfamiliar. Almost every jailbreak family is one of these two, or both. Naming which one you're looking at is the first move of a defender.

The automated-attack lesson — without the payloads

The other landmark read is Universal and Transferable Adversarial Attacks on Aligned LMs (Zou et al., 2023), the GCG paper. Read it strictly as a defender: the result that matters is not any particular string but the shape of the finding. Attacks can be found automatically by optimization rather than human cleverness, and a suffix optimized against one open model can transfer to others it was never tuned on.

The defensive takeaway

If attacks are machine-discoverable and transferable, then "we patched the ones we found" is not a defense — the search space is effectively infinite and shared across models. This is the argument for treating robustness as a property of the whole stack, not of the safety-tuning layer alone. Hold that thought; Day 23 makes it concrete.

Three attack classes to carry into the week

You'll meet these again, so anchor them now: direct jailbreaks (the user crafts the abusive input themselves), prompt injection (instructions arrive through content the model processes — tomorrow's topic, and the scary one for agents), and multimodal evasion (the abusive instruction hides in an image, audio, or other non-text channel a text-only filter never sees). Each maps to different defensive layers, which is exactly why a single safeguard can't cover all three.

Your work today

Read the Mechanism, Not the Recipe

~60-minute foundation

Read §2–3 of Jailbroken: How Does LLM Safety Training Fail? For every example they discuss, label it competing objectives, mismatched generalization, or both — ignore any specific strings, focus on the category.
Read §1–2 of Universal and Transferable Adversarial Attacks on Aligned LMs as a defender. Note the two properties that make this a systemic problem: automated discovery and transferability. Do not reproduce payloads.
In a notebook, list the three attack classes (direct jailbreak, prompt injection, multimodal evasion) and write one sentence on which defensive layer you'd expect to catch each.

The expert move

A novice sees a jailbreak as a clever trick to collect. An expert sees it as a diagnosis: this input won because helpfulness out-pulled harmlessness, or because the input fell outside the safety distribution. The altitude jump is from cataloguing attacks to explaining them — because an explanation generalizes to attacks you haven't seen, and a catalogue doesn't.

Say this in an interview: "I think about jailbreaks through two lenses — competing objectives and mismatched generalization. That tells me safety tuning is necessary but not sufficient: it sets a default, not a guarantee. So I design for defense-in-depth and assume any single layer, including the model's own training, can be bypassed."

Today's Takeaways

Safety training sets a default, not a guarantee — a jailbreak moves the model off it.
Two root causes: competing objectives (safety out-pulled) and mismatched generalization (safety never fires).
Attacks can be automated and transferable, so "patch what we found" is not a defense.
Three classes to track: direct jailbreak, prompt injection, multimodal evasion — each needs a different layer.